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ABSTRACT 

I introduce a new code for fast calculation of the Lomb-Scargle periodogram, that leverages the 
computing power of graphics processing units (CPUs). After establishing a background to the newly 
emergent field of GPU computing, I discuss the code design and narrate key parts of its source. 
Benchmarking calculations indicate no significant differences in accuracy compared to an equivalent 
CPU-based code. However, the differences in performance are pronounced; running on a low-end 
GPU, the code can match 8 CPU cores, and on a high-end GPU it is faster by a factor approaching 
thirty. Applications of the code include analysis of long photometric time series obtained by ongoing 
satellite missions and upcoming ground-based monitoring facilities; and Monte-Carlo simulation of 
periodogram statistical properties. 

Subject headings: methods: data analysis — methods: numerical — techniques: photometric — stars: 
oscillations 



1. INTRODUCTION 

Astronomical time-series observations are often charac- 
terized by uneven temporal sampling (e.g., due to trans- 
formation to the heliocentric frame) and/or non-uniform 
coverage (e.g., from day/night cycles, or radiation belt 
passages). This complicates the search for periodic sig- 
nals, as a fast Fourier transform (FFT) algorithm can- 
not be employed. A variety of alternatives have been put 
forward, the most oft-used being the epon ymous Lomb - 
Sca rgle (L-S] periodogram developed by iLombl ()1976f ) 
and lScargltJ |i982). At the time of writing, NASA's As- 
trophysics Data System (ADS) lists 735 and 1,810 pub- 
lications (respectively) that cite these two papers, high- 
lighting how important the L-S periodogram has proven 
for the analysis of time series. Recent applications in- 
clude the search for a link between sola r rotation and 
nuclear decay rates ([Sturrock ct al."2010); the study of 
pulsar timing noise (JLyne et al. 2010); the c haracteriza- 
tion of quasi-periodic oscillations in blazars (|Rani et al.l 
l2010f ): and the m easurement of rotatio n periods in exo- 
planet host stars (jSimpson et al.|[20Tol) . 

Unfortunately, a drawback of the L-S periodogram is 
a computational cost scaling as 0{Nf), where Nt is the 
number of measurements in the time series; this contrasts 
with the far-more-efficient Oj Nt logg Nt) scaling of th e 
FFT algorithm popularized by lCoolev fc Tukevl (| 19651 ) . 
Oi ie approach to reducing this cost has been proposed 
bv lPress fc Rvbickl () 19891 ). based on constructing a uni- 
formly sampled approximation to the observations via 
'extirpolation' and then evaluating its FFT. The present 
paper introduces a different approach, not through algo- 
rithmic development but rather by leveraging the com- 
puting power of graphics processing units (CPUs) — the 
specialized hardware at the heart of the display sub- 
system in personal computers and workstations. Mod- 
ern GPUs typically comprise a number of identical pro- 
grammable processors, and in recent years there has been 
significant interest in applying these parallel-computing 
resources to problems across a breadth of scientific dis- 
ciplines. In the following section, I give a brief history 



of the newly emergent field of GPU computing; then. 
Section [3] reviews the formalism defining the L-S peri- 
odogram, and Section |4] presents a CPU-based code im- 
plementing this formalism. Benchmarking calculations 
to evaluate the accuracy and performance of the code 
are presented in Section [5l The findings and future out- 
look are then discussed in Section [6l 

2. BACKGROUND TO GPU COMPUTING 
2.1. Pre-2006: Initial Forays 

The past decade has seen remarkable increases in the 
ability of computers to render complex 3-dimensional 
scenes at movie frame-rates. These gains have been 
achieved by progressively shifting the graphics pipeline 
— the algorithmic sequence of steps that converts a scene 
description into an image — from the CPU to dedicated 
hardware within the GPU. To address the infiexibility 
that can accompany such hardware acceleration, GPU 
vendors introduced so-called programmable shaders, pro- 
cessing units that apply a simple sequence of transforma- 
tions to input elements such as image pixels and mesh 
vertices. NVIDIA Corporation were the first to im- 
plement programmable shader functionality, with their 
GeForce 3 series of GPUs (released March 2001) offer- 
ing one vertex shader and four (parallel) pixel shaders. 
The release in the following year of ATI Corporation's 
R300 series brought not only an increase in the num- 
ber of shaders (up to 4 vertex and 8 pixel) , but also ca- 
pabilities such as floating-point arithmetic and looping 
constructs that laid the foundations for what ultimately 
would become GPU computing. 

Shaders are programmed using a variety of special- 
ized language s, such as th e OpenGL Shading Language 
(GLSL; e.g., iRosil \2QM\ and Microsoft's High-Level 
Shading Language (HLSL). The designs of these lan- 
guages are strongly tied to their graphics-related pur- 
pose, and thus early attempts at GPU computing using 
programmable shaders had to map each calculation into 
a seq uence of equi valent graphical operations (see, e.g., 
.Owens et al.ll2005l and references therein). In an effort 
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to overcome this awkward aspect, iBuck et"an (|2004[ ) de- 
veloped BrookGPU — a compiler and run-time imple- 
mentation of the Brook stream programming language 
for GPU platforms. With BrookGPU, the computational 
resources of shaders are accessed through a stream pro- 
cessing paradigm: a well-defined series of operations (the 
kernel) are applied to each element in a typically-large 
homogeneous sequence of data (the stream). 

2.2. Post-2006: Modern Era 

GPU computing entered its modern era in 2006, with 
the release of NVIDIA's Compute Unified Device Ar- 
chitecture (CUDA) — a framework for defining and 
managing GPU computations without the need to map 
them into graphical operations. C UDA-enabled devices 
(see Appendix A of lNVIDIAll2010 f) are distinguished by 
their general-purpose unified shaders, which replace the 
function-specific shaders (pixel, vertex, etc.) present in 
earlier GPUs. These shaders are programmed using an 
extension to the C language, which follows the same 
stream-processing paradigm pioneered by BrookGPU. 
Since the launch of CUDA, other vendors have been quick 
to develop their own GPU computing offerings, most no- 
tably Advanced Micro Devices (AMD) with their Stream 
framework, and Microsoft with their DirectCompute in- 
terface. 

Abstracting away the graphical roots of GPUs has 
made them accessible to a very broad audience, and 
GPU-based computations are now being undertaken in 
fields as diverse as molecular biology, medical imaging, 
geophysic s, fluid dyn a mics, econom ics and cryptogra- 
phy fsee iPharri [20051 : iNguvenl l2007t ). Within astron- 
omy and astrophysics, recent applications include A'^- 
body simulations (Bcllcman et al. 2008), real-time ra- 
dio correlation (Wavth ct al. 2009), gravitational lens- 
ing ([Thompson et aL2010), adaptive-mesh hydrody- 
namics ( Schivc ct aO2010f) and cosmological reionization 
(jAubert fc TeyssieiMoTof T. 

3. THE LOMB-SCARGLE PERIODOGRAM 

This section reviews the formalism defining the Lomb- 
Scargic periodogram. For a time series comprising Nt 
measurements Xj = X{tj) sampled at times tj [j = 
1, . . . , Nt), assumed throughout to have been scaled and 
shifted such that its mean is zero and its variance is unity, 
the normalized L-S periodogram at frequency / is 
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Here and throughout, uj = 2nf is the angular frequency 
and all summations run from j — 1 to j = Nt- The 
frequency-dependent time offset r is evaluated at each w 
via 



tan 2wr — 
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As discussed bv iSchwarzenberg-Czernvl (|1998[ ). Pn in the 
case of a pure Gaussian-noise time series is drawn from 
a beta distribution. For a periodogram comprising Nf 
frequenciefQ, the false-alarm probability (FAP) — that 
some observed peak occurs due to chance fluctuations — 
is 



Q = l- 



1-1- 



2Pn 

Nt 



(Nt-7i)/2 
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Equations ([T]) and ([2|) can be written schematically as 



Pn(/)-5]e[/,(t„X,)], 



(4) 



where Q is some fun ction. In the classification scheme in- 
troduced by Barsde TTet al.l |2010), this follows the form 
of an interact algorithm. Generally speaking, such al- 
gorithms are well-suited to GPU implementation, since 
they are able to achieve a high arithmetic intensity. How- 
ever, a straightforward implementation of equations ([l} 
and ^ involves two complete runs through the time se- 
ries to calculate a single Pn{f), which is wasteful of mem- 
ory bandwidth and requires Nf{ANt -1-1) costly trigono- 
metric function e valuations for the full periodogram. 
IPress et ahl (|1992f ) address this inefficiency by calculat- 
ing the trig functions from recursion relations, but this 
approach is difficult to map onto stream processing con- 
cepts, and moreover becomes inaccurate in the limit of 
large Nf . An alternative strategy, which avoids these dif- 
ficulties while still offering improved performance, comes 
from refactoring the equations as 
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Here, 




Ct= cos WT, Sr — SmUJT, 


(7) 


while the sums 






XC = SJ Xj cos ujtj , 





XS — 2_, Xj sin ujtj , 

j 
CC = ^cos2a;ij, (8) 

j 
SS = '^sin^ujtj, 

j 
CS — 2, cos ujtj sin ojtj , 

i 

can be evaluated in a single run through the time series, 
giving a total of Nf{2Nt -\- 3) trig evaluations for the full 
periodogram — a factor '^ 2 improvement. 

^ Th e iss ue of 'independent' frequencies is briefly discussed in 
Section |6^ 
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4. culsp: a GPU LOMB-SCARGLE PERIODOGRAM CODE 

4.1. Overview 

This section introduces CULSP, a Lonib-Scargle pe- 
riodogram code implemented within NVIDIA's CUDA 
framework. Below, I provide a brief technical overview 
of CUDA. Section 14.31 then reviews the general design 
of CULSP, and Section IT4l narrates an abridged version 
of the kernel source. The full source, which is freely re- 
distributable under the GNU General Public License, is 
provided in the accompanying on-line materials. 



4.2. The CUDA Framework 

A CUDA-enabled GPU comprises one or more stream- 
ing multiprocessors (SMs), themselves composed of a 
numbeiQ of scalar processors (SPs) that are functionally 
equivalent to processor cores. Together, the SPs allow 
an SM to support concurrent execution of blocks of up 
to 512 threads. Each thread applies the same computa- 
tional kernel to an element of an input stream. Resources 
at a thread's disposal include its own register space; 
built-in integer indices uniquely identifying the thread; 
shared memory accessible by all threads in its parent 
block; and global memory accessible by all threads in all 
blocks. Reading or writing shared memory is typically 
as fast as accessing a register; however, global memory 
is two orders of magnitude slower. 

CUDA programs are written in the C language with ex- 
tensions that allow computational kernels to be defined 
and launched, and the differing types of memory be al- 
located and accessed. A typical program will transfer 
input data from CPU memory to GPU memory; launch 
one or more kernels to process these data; and then copy 
the results back from GPU to CPU. Executables are cre- 
ated using the nvcc compiler from the CUDA software 
development kit (SDK). 

A CUDA kernel has access to the standard C mathe- 
matical functions. In some cases, two versions are avail- 
able ('library' and 'intrinsic'), offering different trade-offs 
betw een precision and speed (see Appendix C of N VIDIAl 
120101) . For the sine and cosine functions, the library ver- 
sions are accurate to within 2 units of last place, but are 
very slow because the range-reduction algorithm — re- 
quired to bring arguments into the (— 7r/4,7r/4) interval 
— spills temporary variables to global memory. The in- 
trinsic versions do not suffer this performance penalty, 
as they are hardware-implemented in two special func- 
tion units (SFUs) attached to each SM. However, they 
become inaccurate as their arguments depart from the 
(— TT, tt) interval. As discussed below, this inaccuracy can 
be remedied through a very simple range-reduction pro- 
cedure. 



4.3. Code Design 

The CULSP code is a straightforward CUDA imple- 
mentation of the L-S periodogram in its refactored form 
(equations [BHl]) . A uniform frequency grid is assumed, 

f., = iAf ii = l,...,Nf), (9) 

■^ Eight, for the GPUs considered in the present work. 



where the frequency spacing and number of frequencies 
are determined from 



and 



A/ = 



Nf 



FovcritNt ~ tl) 



high-^ovcr-^ *i 



(10) 



(11) 



respectively. The user-specified parameters Fovcr and 
^high control the oversampling and extent of the peri- 
odogram; Fovor = 1 gives the characteristic sampling es- 
tablished by the length of the time series, while Fhigh = 1 
gives a maximum frequency equal to the mean Nyquist 
frequency /ny = A^t / [2 (iw^ - i i ) ] . 

The input time series is read from disk and pre- 
processed to have zero mean and unit variance, before 
being copied to GPU global memory. Then, the compu- 
tational kernel is launched for Nf threads arranged into 
blocks of size iVj3; each thread handles the periodogram 
calculation at a single frequency. Once all calculations 
are complete, the periodogram is copied back to CPU 
memory, and from there written to disk. 

The sums in equation ([5]) involve the entire time se- 
ries. To avoid a potential memory-access bottleneck, and 
to improve accuracy, CULSP partitions these sums into 
chunks equal in size to the thread block size A'b. The 
time-series data required to evaluate the sums for a given 
chunk are copied from (slow) global memory into (fast) 
shared memory, with each thread in a block transferring 
a single {tj , Xj) pair. Then, all threads in the block enjoy 
fast access to these data when evaluating their respective 
per-chunk sums. 

4.4. Kernel Source 

Figure [T] lists abridged source for the CULSP computa- 
tional kernel. This is based on the full version supplied 
in the on-line materials, but special-case code (handling 
situations where Nt is not an integer multiple of A'^b) has 
been removed to facilitate the discussion. 

The kernel accepts five arguments (lines [2H3] of the 
listing). The first three are array pointers giving the 
global- memory addresses of the time-series (d_time and 
d_data) and the output periodogram (d_P). The remain- 
ing two give the frequency spacing of the periodogram 
(df ) and the number of points in the time series (N_t). 
The former is used on line [TT] to evaluate the frequency 
from the thread and block indices; the macro BLOCK_SIZE 
is expanded by the pre-processor to the thread block size 

Lines [27H701 construct the sums of equation ([8]) , fol- 
lowing the chunk partitioning approach described above 
(note, however, that the SS sum is not calculated explic- 
itly, but reconstructed from CC on line [75]) . Lines [5T1 - 
1361 are responsible for copying the time-series data 
for a chunk from global memory to shared memory; 

the syncthreadsO instructions force synchronization 

across the whole thread block, to avoid potential race 
conditions. The inner loop (lines ITT1 - I55)) then evaluates 
the per-chunk sums; the \#pragma unroll directive on 

^ Set to 256 throughout the present work; tests indicate that 
larger or smaller values give a slightly reduced performance. 
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__global__ void 

culsp_kernelCf loat +d_t , float *d_X, float *d_P , 

float df , int N_t ) 

■c 

__shared__ float s_t [BLOCK_SIZE] ; 

__shared__ float s_X [BLOCK_SIZE] ; 

// Calculate the frequency 

float f = Cblockldx . x+BLOCK_SIZE+threadIdx. x+l)+df 

// Calculate the various sums 



int j ; 

forCj = 0; j < N_t; j += BLOCK_SIZE) i 

// Load the chunk into shared memory 

__syncthreads ( ) ; 

s_t [threadldx.x] = d_t[j+threadldx.x] ; 
s_X[threadIdx.x] = d_X[j+threadIdx.x]; 

__syncthreads ( ) ; 

// Update the sums 

# pragma unroll 

forCint k = 0; k < BLOCK_SIZE; k++) { 

// Range reduction 

float ft = f *s_t [k] ; 
ft -= rintf (ft) ; 



float c; 
float s; 


__sincosf (TWOPI*ft , «is , fie); 


XC.chunk += s_X[k]*c; 
XS_chunk += s_X[k]*s; 
CC_chunk += c*c; 
CS_chunk += c*s; 


} 


XC += XC_chunk 
XS += XS. chunk 
CC += CC_chunk 
CS += CS_chunk 




XC_chunk = O.f 
XS_chunk = O.f 
CC_chunk = O.f 
CS.chunk = O.f 




} 


float SS = (float) N_t - CC ; 


// Calculate the tau terms 


float ct; 
float St; 


__slncosf (O.Bf*atan2(2.f*CS, CC-SS), ftst , ftct); 


// Calculate P 


d_P[blockIdx.x*BLOCK_S I ZE+ threadldx.x] = 

0.5f*((ct*XC + st*XS)*(ct*XC + st*XS)/ 

(ct*ct*CC + 2*ct*st*CS + st*st*SS) + 
(ct*XS - st*XC)*(ct*XS - st*XC)/ 
(ct*ct*SS - 2*ct*st*CS + st*st*CC)); 


// Finish 


} 



Fig. 1. — Abridged source for the CULSP computation kernel. 



TABLE 1 

Specifications for the two GPUs used in the 
benchmarking. 



GPU 


SMs 


SPs 


Clock (GHz) 


Memory (MB) 


GeForce 8400 GS 
Tesla C1060 


1 
30 


8 
240 


1.4 
1.3 


512 
4096 



line lini instructs the compiler to completely unroll this 
loop, conferring a significant performance increase. 

The sine and cosine terms in the sums are evalu- 
ated simultaneously with a call to CUDA's intrinsic 

sincosf function (line 15 1|). To maintain accuracy, a 

simple range reduction is applied to the phase ft by sub- 
tracting the nearest integer [as calculated using rintf () ; 

line|46]. This brings the argument of sincosf () into 

the interval (— 7r,7r), where its maximum absolute error 
is 2~^^-^^ for sin e and 2^^^-^^ for cosine (see Table C-3 
of lNVIDIAll2010l) . 

5. BENCHMARKING CALCULATIONS 

5.1. Test Configurations 

This section compares the accuracy and performance of 
CULSP against an equivalent CPU-based code. The test 
platform is a Dell Precision 490 workstation, contain- 



ing two Intel 2.33 GHz Xeon E5345 quad-core processors 
and 8 GB of RAM. The workstation also hosts a pair 
of NVIDIA GPUs: a Tesla C1060 populating the single 
PCI Express (PCIe) xl6 slot, and a GeForce 8400 GS 
in the single legacy PCI slot. These devices are broadly 
representative of the opposite ends of the GPU market. 
The 8400 GS is an entry-level product based on the older 
G80 hardware architecture (the first to support CUDA) , 
and contains only a single SM. The C1060 is built on 
the newer GT200 architecture (released 2008/2009), and 
with 30 SMs represents one of the most powerful GPUs 
in NVIDIA's portfolio. The technical specifications of 
each GPU are summarized in Table [TJ 

The CPU code used for comparison is LSP, a straight- 
forward port of CULSP to ISO C99 with a few modifi- 
cations for performance and language compliance. The 
sine and cosine terms are calculated via separate calls 
to the sinf and cosf () functions, since there is no 
sincosf function in standard C99. The argument re- 
duction step uses an integer cast instead of rintf () ; this 
allows the compiler to vectorize the inner loops, greatly 
improving performance while having a negligible impact 
on results. Finally, the outer loop over frequency is triv- 
ially parallelized using an OpenMP directive, so that all 
available CPU cores can be utilized. Source for LSP is 
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provided in the accompanying on-line materials. 

The Precision 490 workstation runs 64-bit Gentoo 
Linux. GPU executables are created with the 3.1 re- 
lease of the CUDA SDK, which relies on GNU gcc 4.4 
as the host-side compiler. CPU executables are created 
with Intel's ice 11.1 compiler, using the -03 and -xHost 
optimization flags. 

5.2. Accuracy 




Fig. 2. — Part the L-S periodogram for V1449 Aql, evaluated 
using the LSP code (thick curve). The thin curve shows the ab- 
solute deviation | pcuLsp _ pisp | Qf ^ j^g corresponding periodogram 
evaluated using the CULSP code. The strong peak corresponds to 
the star's dominant 0.18-d pulsation mode. 



-4- 



-6- 
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logP.r 

Fig. 3. — A scatter plot of the L-S periodogram for V1449 Aql, 
evaluated using the LSP (abscissa) and CULSP (ordinate) codes. 



As the validation dataset for comparing the accuracy 
of CULSP and LSP, I use the 150-day photometric time 
series of the /3 Cephei pulsator V 1449 Aql (HP 1 80642) 
obtained by the CoRoT mission (jBelkacem et al.l [20091. 
The observations comprise 382,003 flux measurements 
(after removal of points flagged as bad), sampled un- 
evenly (in the heliocentric frame) with an average sepa- 
ration of 32 s. 

Figure[2]plots the periodogram of V1449 Aql evaluated 
using LSP, over a frequency interval spanning the star's 
domi nant 0.18 d pulsation mode (see ^aclkcns et al] 
Il998| ). Also shown in the figure is the absolute devi- 
ation jP™'"^'' — P^^^l of the corresponding periodogram 
evaluated using CULSP (running on either GPU — the 
results are identical). The figure confirms that, at least 
over this particular frequency interval, the two codes are 
in good agreement with one another; the relative error is 
on the order of 10~^. 

To explore accuracy over the full frequency range. 
Fig. [3] shows a scatter plot of P^^^ against P^^^^^ . Very 
few of the Nf — 1,528,064 points in this plot de- 
part to any significant degree from the diagonal line 
pLsp ^ pcuLsp^ Those that do are clustered in the Pn < 1 
corner of the plot, and are therefore associated with the 
noise in the light curve rather than any periodic signal. 
Moreover, the maximum absolute difference in the pe- 
riodogram FAPs (equation [S]) across all frequencies is 
4.1 X 10~^, which is negligible. 

5.3. Performance 



t5^ GeForce 8400 GS 
-^A— TeslaClOeO 
-^— CPU(1 thread) 
-•— CPU (8 thread.s) 



J 0- 




3 4 5 

logiV, 

Fig. 4. — Mean calculation times (tcalc) for the L-S periodogram, 
evaluated using the CULSP (triangles) and LSP (circles) codes. The 
dashed line, with slope dlog(t(,alc)/<ilog A^t = 2, indicates the 
asymptotic scaling of the periodogram algorithm. 

Code performance is measured by averaging the 
V1449 Aql periodogram calculation time icaic over five 
executions. These timings exclude the overheads in- 
curred by disk input/output and from rectifying light 
curves to zero mean an unit variance. Table [2] lists the 
mean (icaic) and associated standard deviation CT(tcaic) 
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TABLE 2 
Periodogram calculation times. 



Code Platform 



(tcalc) (s) 0-{tcalc) (s) 



CULSP GeForce 8400 GS 

CULSP Tesla C1060 

LSP CPU (1 thread) 

LSP CPU (8 threads) 



570 


0.0093 


20.3 


0.00024 


4570 


14 


566 


6.9 



for CULSP running on both GPUs, and for LSP running 
with a single OpenMP thread (equivalent to a purely se- 
rial CPU implementation), and with 8 OpenMP threads 
(one per workstation core). 

With just one thread, LSP is significantly outperformed 
by CULSP on either GPU. Scaling up to 8 threads short- 
ens the calculation time by a factor '^ 8, indicating near- 
ideal parallelization; nevertheless, the two CPUs working 
together only just manage to beat the GeForce 8400 GS, 
and are still a factor ^ 28 slower than the Tesla C1060. 
Perhaps surprisingly, the latter ratio is greater than sug- 
gested by comparing the theoretical peak floating-point 
performance of the two platforms — 74.6 GFLOPS (bil- 
lions of floating-point operations per second) for all 8 
CPU cores, versus 936 GFLOPS for the C1060. This 
clearly warrants further investigation. 

Profiling with the GNU gprof tool indicates that the 
major bottleneck in LSP, accounting for 80% of (tcaic), 

is the svml_sincosf 4() function from Intel's Short 

Vector Math Library. This function evaluates four sine/- 
cosine pairs at once by leveraging the SSE2 instructions 
of modern x86-architecture CPUs. Microbenchmarking 

reveals that a svml_sincosf 4() call costs ~ 45.6 clock 

cycles, or ~ 11.4 cycles per sine/cosine pair. In contrast, 
thanks to its two special function units, a GPU SM can 
evaluate a sine /cosine pa ir in a single cycle (see Appendix 
G.3.1 of lNVIDIAl[2??l(il ). Scaling these values by the 
appropriate clock frequencies and processor counts, the 
sine/cosine throughput on all 8 CPU cores is 1.6 billion 
operations per second (COPS), whereas on the 30 SMs 
of the C1060 it is 39 COPS, around 23 times faster. Of 

course, it should be recalled that the GPU sincosf () 

functio n op erates at a somewhat-reduced precision (see 
Section Hm. In principle, the CPU throughput could be 
improved by developing a similar reduced-precision func- 
tion to replace svml_sincosf4(). However, it seems 

unlikely that a software routine could ever approach the 
throughput of the dedicated hardware in the SFUs. 

The disparity between sine/cosine throughput ac- 
counts for most of the factor ^ 28 performance difference 
between CULSP and LSP, noted above. The remainder 
comes from the ability of an SM to execute instructions 
simultaneously on its SFUs and scalar processors. That 
is, the sine/cosine evaluations can be undertaken in par- 
allel with the other arithmetic operations involved in the 
periodogram calculation. 

Looking now at the memory performance of CULSP, 
NVIDIA's cudaprof profiling tool indicates that almost 
all reads from global memory are coalesced, and that no 
bank conflicts arise when reading from shared memory. 
Thus, the GPU memory accesses patterns can be consid- 
ered close to optimal. The combined time spent copying 
data from CPU to GPU and vice versa is ~ 6 ms on the 
C1060, and ~ 29 ms on the 8400 GS; while these val- 
ues clearly reflect the bandwidth difference between the 



PCIe and PCI slots hosting the GPUs, neither makes any 
appreciable contribution to the execution times listed in 
Table H 

To round off the present discussion, I explore how 
CULSP and LSP perform with different-sized datasets. 
The analysis in Section |3] indicates a periodogram work- 
load scaling as 0{NfNt). With the number of frequen- 
cies following Nf oc Nt (equation [TT|) . icaic should there- 
fore scale proportionally with Nf — as in fact already 
claimed in Introduction. To test this expectation. Fig. |4] 
shows a log- log plot of (tcaic) as a function of Nt, for the 
same configurations as in Table [2j The light curve for a 
given Nt is generated from the full V1449 Aql light curve 
by uniform down-sampling. 

In the limit of large Nt, all curves asymptote toward 
a slope dlog(tcaic)/dlog7Vt = 2, confirming the hypoth- 
esized Nf scaling. At smaller Nt, departures from this 
scaling arise from computational overheads that are not 
directly associated with the actual periodogram calcula- 
tion. These are most clearly seen in the LSP curve for 8 
threads, which approaches a constant log(icaic) ~ —1.5 
independent of Nt — perhaps due to memory cache con- 
tention between the different threads. 

6. DISCUSSION 
6.1. Cost/Benefit Analysis 

To establish a practical context for the results of the 
preceding sections, I briefly examine the price vs. per- 
formance of the CPU and GPU platforms. At the time 
of writing, the manufacturer's bulk (1,000-unit) pricing 
for a pair of Xeon E5345 CPUs is 2 x $455, while a 
Tesla C1060 retails for around $1,300 and a GeForce 8400 
GS for around $50. First considering the C1060, the 
50% greater cost of this device (relative to the CPUs) 
brings almost a factor thirty reduction in periodogram 
calculation time — an impressive degree of leveraging. 
However, its hefty price tag together with demanding 
infrastructure requirements (dedicated PCIe power con- 
nectors, supplying up to 200 W), means that it may not 
be the ideal GPU choice in all situations. 

The 8400 GS offers a similar return-on-investment at 
a much-more affordable price: almost the same perfor- 
mance as the two CPUs at one-twentieth of the cost. 
This heralds the possibility of budget GPU computing, 
where a low-end desktop computer is paired with an 
entry-level GPU, to give performance exceeding high- 
end, multi-core workstations for a price tag of just a few 
hundred dollars. Indeed, many desktop computers to- 
day, or even laptops, are already well equipped to serve 
in this capacity. 

6.2. Applications 

An immediate application of CULSP is analysis of the 
photometric time series o btained by o i igoing satellite 
missions such as MOST (iWalker et all [200l . CoRoT 
(|Auvergnd [2009I ). and Kevler (iKochTlmol ). These 
datasets are typically very large {Nt > 10^), lead- 
ing to a significant per-star cost for calculating a pe- 
riodogram. When this cost is multiplied by the num- 
ber of targets being monitored (in the cast of Ke- 
pler, again > 10^), the overall computational bur- 
den becomes very steep. Looking into the near 
future, similar issues will be faced with ground- 
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based time-d omain facilities such as Pan-STARRS 
( Kaiseiil2002D and the Large Synoptic Survey Te l escop e 



( LSST Science Collaborations and LSST Proiectl [2009I) . 
It is hoped that CULSP, or an extension to other related 
periodograms (see below), will help resolve these chal- 
lenges. 

An additional application of CULSP is in the interpreta- 
tion of periodograms. Equation ([3]) presumes that the P^ 
at each frequency in the periodogram is independent of 
the others. This is not necessarily the case, and the expo- 
nent in the equation should formally be replaced by some 
number Nf_\ „d representing the numb er of independent 
frequencies. iHorne fc Baliunai ()1986D pioneered the use 
of simulations to estimate iVjjnd empirically, and similar 
Monte- Carlo techniques have since been applied to ex- 
plore the s tatistical properties o f the L-S periodogram in 
detail (see 'Frescura et ahl 120081 and references therein). 
The bottleneck in these simulations are the many peri- 
odogram evaluations, making them strong candidates for 
GPU acceleration. 



iZechmeister fc Kiirsteii ()2009l ) generalize the L-S peri- 
odogram to allow for the fact that the time-series mean is 
typically not known a priori, but instead estimated from 
the data themselves. The expressions derived by these 
authors involve sums having very similar forms to equa- 
tion ([H); thus, it should prove trivial to develop GPU 
implementations of the generaliz ed periodograms. The 
multi-harmonic periodogram of iSchwarzenber g- Czernvl 
(|1996[) and the SigSpec method of JReegenl ()2007[ ) also ap-^ 
pear promising candidates for implementation on CPUs, 
although algorithmically they are rather more-complex. 
Looking at the bigger picture, while the astronomical 
theory and modeling communities have been quick to rec- 
ognize the usefulness of CPUs (see Section [T]), progress 
has been more gradual in the observational community; 
radio correlation is the only significant application to 
date (Wav th et al.l l2009l. It is my hope that the present 
paper will help illustrate the powerful data-analysis ca- 
pabilities of CPUs, and demonstrate strategies for using 
these devices effectively. 



6.3. Future Work 

Recognizing the drawbacks of being wedded to one 
particular hardware/software vendor, a strategically im- 
portant future project will be to port CULSP to Open 
CL (Open Computing Language) — a recently devel- 
oped standard for programming devices such as multi- 
core CPU s and CPUs in a platform-neutral manner 
(see, e.g., iStone et"al] I2010D . There is also consider- 
able scope for applying the lesson s learne d here in to 
other spectral analysis techniques. IShrageri (J2001D and 
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